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Communities of vertices within a giant network such as the World-Wide Web are likely to be vastly 
smaller than the network itself. However, Fortunato and Barthelemy have proved that modularity 
m axim ization algorithms for community detection may fail to resolve communities with fewer than 
\/ L/2 edges, where L is the number of edges in the entire network. This resolution limit leads 
modularity maximization algorithms to have notoriously poor accuracy on many real networks. 
Fortunato and Barthelemy's argument can be extended to networks with weighted edges as well, 
and we derive this corollary argument. We conclude that weighted modularity algorithms may fail 
to resolve communities with fewer than total edge weight, where W is the total edge weight 

in the network and e is the maximum weight of an inter-community edge. If e is small, then small 
communities can be resolved. 

Given a weighted or unweighted network, we describe how to derive new edge weights in order to 
achieve a low e, we modify the "CNM" community detection algorithm to maximize weighted mod- 
ularity, and show that the resulting algorithm has greatly improved accuracy. In experiments with 
an emerging community standard benchmark, we find that our simple CNM variant is competitive 
with the most accurate community detection methods yet proposed. 
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I. INTRODUCTION 

Maximizing the modularity of a network, as defined by 
Girvan and Newman [T], is perhaps the most popular and 
cited paradigm for detecting communities in networks. 
There are many algorithms for approximately maximiz- 
ing modularity and its variants, such as [2l[3l|4]. Com- 
munity assignments of good modularity feature groups 
of nodes that are more tightly connected than would be 
expected. We give the formal definition of modularity 
below. Recent literature, however, has begun to focus on 
paradigms other than modularity maximization. This is 
in part due to Clauset, Newman, and Moore [S], who now 
advocate a more general notion of "community" than 
that associated with modularity. The shift away from 
modularity maximization is also due to Fortunato and 
Barthelemy |6 , who prove that any community assign- 
ment produced by a modularity maximization algorithm 
will have predictable deficiencies in certain realistic situ- 
ations. Specifically, they argue that any solution of max- 
imum modularity will suffer from a resolution limit that 
prevents small communities from being detected in large 
networks. Furthermore, work by Dunbar \7\ indicates 
that true human communities are generally smaller that 
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150 nodes. This size is far below the resolution limit 
inherent in many large networks, such as various social 
networking sites on the World Wide Web. 

We agree with Clauset, Newman, and Moore's fS] idea 
that it is useful to consider more general definitions for 
"community" ; however, we maintain that it is still impor- 
tant to detect traditional, tightly-connected communities 
of nodes. In this paper, we revisit the negative result of 
Fortunato and Barthelemy and analyze it in a different 
light. We show that positive results are possible without 
contradicting the resolution limit. The key is to apply 
carefully computed weights to the edges of the network. 

With one exception, previous methods for tolerating 
this resolution limit require searching over an input pa- 
rameter. For example, Li, et al. [8 address the resolution 
limit problem by defining a modularity alternative called 
modularity density. Given a fixed number of communi- 
ties fc, solving a k-means problem will maximize modu- 
larity density. Li, et al. generalize modularity density so 
that tuning a parameter A favors either small communi- 
ties (large A) or large communities (small A) 51. Arenas, 
Fernandez, and Gomez also address the problem of reso- 
lution limits [S] . They provide the user with a parameter 
r that modifies the natural community sizes for modular- 
ity maximization algorithms. By tuning r, they influence 
the natural resolution limit. At certain values of r, small 
communities will be natural, and at other values of r, 
large communities will be natural. Our methods apply 
without specifying any target scale for natural communi- 
ties, and resolve small and large communities simultane- 
ously. 

One solution that resolves communities at multiple 
scales with no tuning parameter is the HQcut algorithm 
of Ruan and Zhang [10 . This algorithm alternates be- 
tween spectral methods and efficient local improvement. 
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It uses a statistical test to determine whether to spHt each 
community. Ruan and Zhang argue that a subnetwork 
with modularity significantly greater than that expected 
of a random network with the same sequence of vertex 
degrees is likely to have sub-communities, and therefore 
should be split. As Fortunato points out in his recent 
survey though, this stopping criterion is an ad-hoc 
construction. 

Nevertheless, Ruan and Zhang present compelling ev- 
idence that the accuracy of HQcut often exceeds that of 
competitors such as Newman's spectral method followed 
by Kernighan-Lin local improvement |12j and the sim- 
ulated annealing method of Guimera and Amaral |13| . 
The HQcut solution is not simply the solution of global 
maximum modularity, so it is not bound by the resolution 
limit. We obtained the authors' Matlab code for HQcut 
and we present comparisons with our approach below. 



II. RESOLUTION LIMITS 

Fortunato and Barthelemy |5] define a module to be a 
set of vertices with positive modularity: 
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>0, 



(1) 



where Ig is the number of undirected edges (links) within 
the set, ds is the sum of the degrees of the vertices within 
the set, and L is the number of undirected links in the 
entire network. These modules contain more edges than 
we would expect from a set of vertices with the same de- 
grees, were edges to be assigned randomly (respecting the 
invariant vertex degrees). Let us define such modules to 
be natural communities with respect to modularity max- 
imization. We say that a natural community is minimal 
if it contains no other natural communities. We wish 
to resolve the minimal natural communities, and we will 
discuss this goal in Section |VIIIB[ 

In order to ensure that such modules are resolved in a 
global community assigment with maximum modularity, 
Fortunato and Barthelemy [6J argue that the following 
must hold: 



Is > 



(2) 



They back up this mathematical argument with empiri- 
cal evidence. Even in a pathologically easy situation, in 
which the modules are cliques, and only one edge links 
any module to a neighboring module, the individual mod- 
ules will not be resolved in any solution of maximum 
modularity. Instead, several cliques will be merged into 
one module. Experiments show that the numbers of links 
in the resulting modules closely track the predic- 
tion. 

Work by Dunbar [7 indicates that true human com- 
munities are generally limited to roughly 150 members, 
and this is corroborated by the recent work of Leskovec, 



Lang, Dasgupta, and Mahoney [T3]. Such communities 
will have dramatically fewer than edges in prac- 

tice. Based on this argument, it would seem that there 
is little hope for the solutions of modularity maximiz- 
ing algorithms to be applied in real situations in which 
L ^ Ig. Indeed, partially due to the resolution limit 
result, the general direction of research in community 
detection seems to have shifted away from modularity 
maximization in favor of machine learning techniques. 

In this paper, we revisit the resolution limit in the con- 
text of edge weighting and derive more positive results. 



III. RESOLUTION WITH EDGE WEIGHTS 

The definition of a module in equation [l] can easily be 
generalized when edges have weights. Let Ws be the sum 
of the weights of all undirected edges connecting vertices 
within Set s. Let d"'{v), the weighted degree of vertex 
V, be the sum of the weights of all edges incident on v. 
We define df = '^^^sd'^{v) to be the sum of weighted 
degrees of the vertices in Set s. Then Set s is a module 
if and only if: 
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> 0. 



(3) 



Following [6] step- by-step, when considering a module, 
we use to denote the sum of the weights of the edges 
leaving Set s, and also note that w™' = agWs, where is 
a convenience that enables us to rewrite the definition of a 
module in a useful way. We now have d^ = 2ws + tu™* = 
{as + 2)ws, and a new, equivalent, definition of a module: 



Ws 
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2W 



> 0. 



(4) 



Manipulating the inequality, we obtain the relation- 
ship: 



Ws < 



AW 



(a, + 2)2 



(5) 



Thus, sets representing communities must not have too 
much weight in order to be modules. 



IV. THE MAXIMUM WEIGHTED 
MODULARITY 

Fortunato and Barthelemy describe the most modular 
network possible. This yields both computed figures that 
can be corroborated by experimental evidence, and intu- 
ition that the resolution limit in community detection 
has a natural scale that is related to the total number of 
links in the network. We will use the same strategy for 
the weighted case. 
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First, we imagine a network in which every module is 
a clique. For a given number of nodes and number of 
cliques, the modularity will be maximized if each clique 
has the same size. Weighting does not change the ar- 
gument of [6j that the modularity approaches 1.0 as the 
number of cliques goes to infinity. Now, following [6], 
we consider a slight relaxation of the simple case above: 
the most modular connected network. This will be our 
set of m cliques with at least m — 1 edges to connect 
them. Without loss of generality, we consider the case 
of TO connecting edges — a ring of cliques, as studied 
by [15]. 

Departing for a moment from we now consider an 
edge weighting for the network. With edge weights in the 
range [0, 1], the optimal weighting would assign 1 to each 
intra-clique edge and to each connecting edge. The 
weighted modularity of this weighted network would be 
equivalent to the unweighted modularity of the to inde- 
pendent cliques described above, and would tend to 1. 

Relaxing this idealized condition, now assume that we 
have a weighting function that assigns e to each con- 
necting edge, and 1.0 to each intra-clique edge. We now 
analyze the resulting weighted modularity. 

The total edge weight contained within the cliques is 



s=l 



(6) 



Each clique is a module by ([s]) provided that e is suffi- 
ciently small. Summing the contributions of the modules, 
we find the weighted modularity of the network when 
broken into these cliques is: 
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Since all modules contain the same weight, for all s 
W — em W 



(8) 



The maximum modularity of any solution with to com- 
munities is: 
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To quantify this maximum, we take the derivative with 
respect to to: 
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Setting this to zero, we find the number of communities 
in the optimal solution: 



(11) 



Substituting into we find the maximum possible 
weighted modularity: 



Qm{W) = 1 - 



(12) 



The unweighted versions of equations |11| and [9] from 
[B] are, respectively, to* = y/L, and Qm{L) = 1 — In 
this unweighted case, the natural scale is clearly related 
to L. We don't expect to be able to find many more 
than y/L modules in any solution of optimal unweighted 
modularity. 

Our weighted case is similar, but the introduction of e 
leads to some intriguing possibilities. If e can be made 
small enough, for example, then there is no longer any 
limit to the number of modules we might expect in any 
solution of maximum weighted modularity. 



V. THE WEIGHTED RESOLUTION LIMIT 

In E] , Fortunato and Barthelemy prove that any mod- 
ule in which I < may not be resolved by algorithms 
that maximize modularity. Their argument character- 
izes the condition under which two true modules linked 
to each other by any positive number of edges will con- 
tribute more to the global modularity as one unit rather 
than as two separate units. This result is corroborated 
by experiment. In a large real-world dataset such as the 
WWW, modules with I <C L will almost certainly exist. 

Following the arguments of [6 directly, while consid- 
ering edge weights, we now argue that any module s in 
which 



Ws < 



(13) 



may not be resolved. Consider a scenario in which two 
small modules are either merged or not. Suppose that the 
first module has intra- module edges of net weight Wi, and 
the second has intra-module edges of net weight W2 . We 
assume that inter-module edges between these two mod- 
ules have weight e, explicitly write the expressions for 
weighted modularity in both cases, and find their differ- 
ence. The weighted modularity of the solution in which 
these two modules are resolved exceeds that in which 
they are merged, provided: 



w < 



2We/w 



+ ^ +2)(^ + ^ +2) 



(14) 



where w could be either w\ or W2- Manipulation of this 



expression gives ( 13 1 



Two challenges remain: finding a method to set edge 
weights that achieve a small e, and adapting modular- 
ity maximization algorithms to use weights. The second 
challenge is partially addressed by [TB] and but we 
take a different approach. 
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VI. EDGE WEIGHTING 

There are myriad ways to identify local structure with 
local computations. Several approaches to community 
detection, such as [31 |T71 |T5], are based upon this idea. 
We use local computations to derive new edge weights. 
Our approach is to reward an edge for each short cycle 
connecting its cndpoints. These suggest strong intercon- 
nections. 




FIG. 1: Edge neighborhood weighting 

For a vertex v, let E{v) be the set of all undirected 
edges incident on v. We also define the following sets 
to express triangle and rectangle relationships between 
pairs of edges. 

Tg = {e' : there exists a 3-cycle containing both e and e'} 



i?e — {e' : there exists a 4-cycle containing both e and e'} 

Note that e can be a member of Tg and _Re . 

The total weight of edges incident on the endpoints of 
edge e = (u, v) is 



We 



E 



'eE{u)uE{v) 



We consider incident edges that reside on paths of at most 
three edges connecting the endpoints of e to be "good" 
with respect to e. 

Ge = ^ We'- 

e'<£E{u)UE{v)n{T^UR^} 

Such edges add credence to the proposition that e is an 
intra-community edge. We define neighborhood coherence 
of e as follows: 



C(e) 



Ge 
We 



For example, in Figure [T] the coherence is computed 
by summing the weights of the thickened edges and divid- 
ing by the total weight of edges incident on the endpoints 
of e: C(e) = Alternate definitions are possible, of 



course, but this weighting is intuitive and performs well 
in practice. 

Arenas, Fernandes, and Gomez, by contrast, add self- 
loops to vertices according to their r parameter, thereby 
"weighting" the nodes, and also adding more intra- 
community edges to each module. Thus, they pack more 
edges into each module in order to satisfy Inequality [2] . 

We have considered generalizing C{e) to include cycles 
of length 5 and greater, but this would be a consider- 
able computational expense, and we expect diminishing 
marginal benefit. 

Now we give a simple iterative algorithm for computing 
edge weights: 

1. Set We = 1.0 for each edge e in the network (or ac- 
cept We as input if the edges are already weighted) . 

2. Compute C(e) for each e, set We = G{e). 

3. If any We's changed within some tolerance, go to 
Step [2] 

This process will tend to siphon weight out of the inter- 
module edges (those with smaller C(e)), thus lowering e. 
We find in practice that it terminates in a small num- 
ber of iterations. Computing C(e) reduces to finding the 
triangles and 4-cycles in the graph. This can be done 
naively in 0(mn log n) time on scale-free graphs. We use 
Cohen's data structures [TQ that admit more efficient al- 
gorithms in practice. For WWW-scale graphs, it may 
be necessary for efficiency reasons to ignore edges inci- 
dent on high-degree vertices. This would isolate these 
vertices. However, since such vertices often have special 
roles in real networks, they might require individual at- 
tention anyway. 

We define Algorithm W{k) to be k iterations through 
the loop in Steps |2}(3| 



VII. WEIGHTED CLAUSET-NEWMAN-MOORE 

Any modularity maximization algorithm could be 
made to leverage edge weights such as those computed 
in the previous section. Newman replaces individual 
weighted edges with sets of multiple edges, each with 
integral weight [TS]. We modify the agglomerative al- 
gorithm of Clauset, Newman, and Moore (CNM) [5] to 
handle arbitrary weights directly. 

The CNM algorithm efficiently computes the change 
in modularity (AQ) associated with all possible merg- 
ers of two existing communities. At the beginning, each 
vertex is in its own singleton community. Unweighted 
modularity is defined as follows: 



Q = 
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is the adjacency matrix entry for directed edge 
{v,w), ky is the degree of vertex v, Crs is the fraction 
of edges that link vertices in community r to vertices in 
community s, and = e^s is the sum of the degrees 
of all vertices in community s divided by the total degree. 
The function S(cy, c^) equals 1 if w and w are in the same 
community, and otherwise. 

Since vertices i and j initially reside in their own sin- 
gleton communities, Cij is initially simply The first 
step in CNM is to initialize AQ for all possible mergers: 



AQ 



1/{L) — 2kikj /{2L)'^ if i,j are connected 
otherwise. 



(15) 



Algorithm 


e 


771* 


l-s*! 


Qm 


Q 


CNM 


N.A. 


108 


108 


0.980 


0.980 


wCNMi 


0.111 


286 


263 


0.9930 


0.9928 


wCNMs 


< 0.000001 


1000 


1000 


0.9999 


0.9986 



TABLE I: These results from the ring of 1000 5-cliques illus- 
trate gains made by considering weighting. Predicted (m*) 
and algorithmically discovered (|5|) numbers of communities 
match well and indicate that careful weighting makes it pos- 
sible to resolve all 1000 cliques as modules in a solution of 
maximal weighted modularity. Qm is defined in (121, m* is 



defined in (111, and e is the weight of the heaviest edge be- 



tween two communities. 

CNM also initializes di — ^ for each vertex i. Once 
the initializations are complete, the algorithm repeatedly 
selects the best merger, then updates the AQ and 
values, until only one community remains. The solution 
is the community assigment with the largest value of Q 
encountered during this process. Clever data structures 
allow efficient update of the AQ values. 

To modify CNM to work on weighted graphs, we need 
only change the initialization step. The update steps are 
identical. We simply define and compute the weighted 



degree of each vertex kf — ^ 
becomes: 



The initialization 



and 



fcj' 

2W- 



Wiol{W) - 2kfkJ/{2Wf if i,j are connected 
otherwise, 

(16) 

With these initializations, normal CNM 
merging greedily maximizes weighted modularity Q™. 
We refer to this algorithm as wCNM. Note that our def- 
inition of is equivalent to that of |4]. 



VIII. RESULTS 

Given an undirected, weighted or unweighted network, 
we apply the Algorithm W{k) to set our edge weights, 
then run wCNM. We use wCNM^ to denote this two- 
step process. Note that running wCNMq is equivalent to 
running CNM. 



We will consider two different datasets: the ring of 
cliques example discussed above, and the benchmark 
of [5U], which is a generalization of the 128-node bench- 
mark of Girvan and Newman 1211 . 



A. The ring of cliques 

Refer to Table |T]for the following discussion. Danon, 
Dfaz-Guilera, Duch, and Arenas [15 considered m dis- 
connected cliques as a pathological example of maxi- 
mum modularity (which approaches 1.0 as the number 
of cliques increases). Fortunato and Barthelemy [B] add 
single connections between cliques to form a ring. Our 
intuition is that the natural communities in such a graph 
are the cliques. However, the resolution limit argument 
of Fortunato and Barthelemy indicates that this will not 
be the solution of maximum modularity if each clique has 
fewer than ^ edges. They confirm this via experiment, 
and we have reproduced their results for an instance with 
1000 cliques of size five. Table |I] summarizes the perfor- 
mance of CNM and wCNM for this case. The m* column 
contains the number of communities expected in a solu- 
tion of maximum weighted modularity, as defined in |11[ 
The first row shows the unweighted case, in which m* 
is equivalent to that defined in [6j. CNM achieves this 
theoretical maximum by finding 108 communities, which 
is much smaller than the number of cliques. 

If we run wCNMi, which performs one iteration of 
neighborhood coherence, we obtain the results in Row 
2 of Table IT] The value of e we observe is 0.047, lead- 
ing via ( [TT| ) to an estimate of 286 resolved communi- 
ties. The wCNMi algorithm resolves 263. In a run with 
five iterations, labeled wCNMs, we both expect and find 
1000 communities, resolving all of the natural communi- 
ties and simultaneously observing our highest weighted 
modularity. Iterating further reduces e without changing 
the community assignment. 



B. The LFR Benchmark 

Lancichinetti, Fortunato, and Radicchi [20] (LFR) 
give a generalization of the popular Girvan and New- 
man benchmark |21j for evaluating community detec- 
tion algorithms. The latter consists of 128-vertex ran- 
dom graphs, each with 4 natural communities of size 32. 
The user tunes a parameter to adjust the numbers of 
intra-community and inter-community edges. Many au- 
thors use this benchmark to create plots of "mutual in- 
formation," or agreement in node classification between 
algorithm-discovered communities and natural commu- 
nities. The LFR benchmark is similar in spirit, but con- 
siderably more realistic. It allows the user to specify dis- 
tributions both for the community sizes and the vertex 
degrees. Users also specify the average ratio (per ver- 
tex) of inter-community adjacencies to total adjacencies. 
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0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 
LFR mu parameter 



FIG. 2: Mutual information study for the LFR benchmark. 

called mixing parameter fi. At fj, — 0.0, all edges are 
intra-community. 

The LFR benchmark construction process begins by 
sampling vertex degrees and creating a graph with the 
selected degree distribution. It then samples community 
sizes. A vertex of degree k should have about (1 — fi)k 
neighbors from the same community. Therefore, it is as- 
signed to a community with at least (1 — 1 vertices. 
LFR assigns vertices to communities via an interated ran- 
dom process enforcing this constraint, then rewires until 
the average /i meets the desired value. We have a spe- 
cial interest in the LFR benchmark because it generates 
graphs with both small and large natural communities. 

For several different values of /i, we used the C code 
from Fortunato's web site (cited in |20]) to generate 30 
instances each of LFR benchmark graphs, each with 5000 
vertices and average degree 8. The community sizes were 
selected from the power-law distribution f{k) ^ k^^-^, 
with k G [10,105]. The degree distribution was f{k) ^ 
k^^, with k e [2,50]. We specified an average degree of 
8, which is roughly comparable to that of the WWW. 

Figure [2] contains the mutual information plot for our 
experiments with LFR. Our metric for comparison is the 
Jaccard index [2^ : 



where A is the set of intra-community edges in the LFR 
ground truth, and B is the set of intra-community edges 
in an algorithm solution. As predicted by the resolution 
limit argument, CNM, an unweighted modularity max- 
imization algorithm, is not able to resolve most of the 
natural communities. However, even with these more re- 
alistic data, wCNM achieves greater accuracy than the 
sophisticated HQcut algorithm. This is notable, consid- 
ering the reputation for poor accuracy recently associated 
with agglomerative algorithms such as CNM and its vari- 



ants [23]. The accuracy of our CNM variant, on the other 
hand, is competitive. 

We observe for these data that iterating the neighbor- 
hood coherence weighting provides diminishing marginal 
returns. However, as we show below, such iteration does 
add value. 

In addition to the mutual information, we wish to com- 
pare the distributions of the sizes of communities discov- 
ered by CNM and its weighted variants to the original 
distributions used in LFR generation. It is a challenge 
to fit empirical data to heavy-tailed power-law distribu- 
tions. However, the discrete power-law distribution of 
community sizes used by LFR is not heavy-tailed. LFR 
uses the following precise sampling process to determine 
ground truth community sizes: 

1. Compute fc^^, the probability that a community 
will have size k. 

2. For all a < fc < 5, where a and b bound the commu- 
nity sizes, compute the empirical cumulative distri- 
bution function for k: pk = ^\i^g^k' . 

3. For a uniform random variate x € [0, 1], find the 
minimum k' such that p\^i > x. 

This process continues until the sum of the community 
sizes exceeds the number of vertices, and the final com- 
munity is truncated. 

We approach the problem of testing goodness-of-fit of 
sets of algorithm-generated community sizes by generat- 
ing visualizations and performing hypothesis tests. In 
both cases, we compare the empirical distributions of 
community sizes with the untruncated discrete power- 
law distribution that underlies the LFR distribution. 

For visualization, we generate quantile-quantile plots 
using the R language [IJ and its quantile() function with 
interpolation type 8. This is the recommendation of Hyn- 
dman and Fan [25]. Figure [3] shows three such plots: one 
LFR instance each of ji values 0.1, 0.3, and 0.5. With the 
moderate community coherence oi fi = 0.3, the wCNM 
variants track the target distribution closely, show a dras- 
tic improvement over CNM, and appear to dominate HQ- 
cut. This latter claim is corroborated by the hypothesis 
tests described below. At /i = 0.5, the advantage over 
CNM is still clear, but neither wCNM nor HQcut track 
the target distribution closely. 

To augment our results with statistical evidence, we 
use the classical Kolmogorov-Smirnov (K-S) test as de- 
scribed, for example, in [26' . Our null hypothesis is that 
the algorithm-generated community sizes follow a dis- 
crete power-law with r = 1.5. We computed critical 
values for each sample size between 10 and 290. The 
former sometimes occurs in CNM output because of the 
resolution limit, and the latter sometimes occurs in HQ- 
cut output as its stopping criterion encourages splitting 
communities with high modularity. The average number 
of target communities in our LFR instances is roughly 
150. For each sample size, the critical value is the 95th 
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Quantile-Quantile Plot, mu = 0.1 Quantile-Quantile Plot, mu - 0.3 Quantile-Quantile Plot, mu - 0.5 
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Target Community Size Target Community Size Target Community Size 



(a) (b) (c) 

FIG. 3: Example distributions of community sizes are shown in these quantile-quantile plots. The line y — x represents a 
perfect match between discovered community sizes and the LFR power-law distribution. 



percentile of computed K-S statistic values. We used 
100,000 trials per sample size. 

After computing critical values, we evaluated the K-S 
statistic for each of our trials at each value of /i. If we re- 
ject the null hypothesis then we have 95% confidence that 
the algorithm results do not follow the discrete power-law 
distribution. Table |IT] summarizes our results for all in- 
stances, broken down by algorithm type and /i value. 

Both Figure |3] and Table |!l] expose a phenomenon we 
call fracturing. We refer to the communities defined by 
LFR as target communities. There is no guarantee that 
target communities will be minimal natural communi- 
ties. In fact, the subgraph induced by a target commu- 
nity is itself a random graph, and therefore we expect 
these to contain minimal natural communities occasion- 
ally. Modularity-based algorithms such as CNM, wCNM, 
and HQcut will find these smaller communities when they 
exist. In Table |TT] note that wCNMs fails more K-S tests 
than does wCNMa with increasing /x. As we add more 
iterations to the edge weighting scheme described in Sec- 
tion |VI[ we enable wCNM to resolve smaller communi- 
ties. The most plausible explanation for the increased K- 
S failure rate of wCNMs, holding /i constant, is that we 
detect smaller communities whose sizes were not drawn 
from the LFR power-law. 

Figure |3] (b) corroborates this observation. Note that 
for wCNMs, the quantile of target community size 10 cor- 
responds to that of discovered community size less than 
5. Figure [3] (c) shows that HQcut also finds communities 
smaller than size 10. 

Algorithms such as wCNM and HQcut ascribe hierar- 
chical community structure to a graph based on mod- 
ularity. Some members of a large collection of random 
graphs, such as the LFR target communities, will have 
statistically significant sub-communities. Lang [27] uses 
an information theoretic metric to distinguish random 
graphs from those with community substructure. We 
conjecture that Lang's method will judge some LFR tar- 
get communities to be non-random. Modularity-based 
algorithms will find substructure in these cases. 





LFR fi 


Algorithm 


0.1 


0.2 


0.3 


0.4 


0.5 


CNM 


0/29 


0/30 


0/30 


0/30 


0/29 


wCNM.l 


17/29 


0/30 


0/30 


0/30 


0/29 


wCNM_3 


28/29 


29/30 


29/30 


23/30 


0/29 


wCNM_5 


28/29 


30/30 


14/30 


0/30 


0/29 


HQcut 


12/29 


5/30 


2/30 


2/30 


0/29 



TABLE II: This table shows Kolmogorov-Smirnov (K- 
S) results for experiments with 5000-vertex LFR instances 
(^passed tests/#instances). The critical values for the test 
were derived empirically by computing the K-S statistic for 
100,000 samples, for each possible sample size between 10 and 
290 communities. The hypothesis test results presented are 
at the 95% confidence level. 



We have not included formal running-time compar- 
isons since Ruan and Zhang's publically available HQ- 
cut implementation is in Matlab and our implementa- 
tion of wCNM is in C/C+-I-. For anecdotal purposes, 
the wCNM runs on our 5000-vertex LFR instances took 
roughly 10s on a 3Ghz workstation, even with several 
iterations of weighting. The HQcut instances took 5- 
10 minutes on the same machine, though there were in- 
stances that took many hours. We killed such instances, 
and that is why we sometimes present fewer than 30 in- 
stances of HQcut results per /i. 



IX. CONCLUSIONS 

We agree with Arenas, Fernandez, and Gomez [3] that 
it may be premature to dismiss the idea of modularity 
maximization as a technique for detecting small com- 
munities in large networks. Our weighted analogue to 
Fortunato and Barthelemy's resolution argument leaves 
open the possibility for much greater community resolu- 
tion, given proper weighting. Furthermore, our simple 
adaptation of the CNM heuristic, when combined with 



8 



a careful computation of edge weights, is able to resolve 
communities of varying sizes in test data. Furthermore, 
we have given empirical evidence that the true ability of 
such techniques to resolve small, local communities may 
be greater than that suggested by analysis. 

Arguably, the original, unweighted CNM already pro- 
vides output that could help mitigate the resolution limit. 
This agglomerative heuristic constructs a dendrogram of 
hierarchical communities, and therefore does recognize 
small communities as modules before merging them into 
larger communities. In this sense, these small communi- 
ties actually are "resolved" - they are stored in the den- 
drogram included in the CNM output. A cut through this 
dendrogram defines the community assignments. The 
resolution limit leads us to expect that the communi- 
ties defined by this cut will be unnaturally large. One 
potential research direction would be to mine this den- 
drogram for the true communities. In effect, this would 
mean ignoring the cut provided by CNM, and therefore 
abandoning the idea of maximizing modularity. 

Our wCNM heuristic likewise produces a dendrogram 
and a cut through that dendrogram defining communi- 
ties. However, the cut provided by wCNM is much deeper 
and more uneven. It is analagous to the potential result 
of mining the CNM dendrogram for natural communities. 



yet the tie with modularity is maintained since wCNM's 
solution exhibits a maximal weighted modularity. 

The edge weighting we describe is just one of many 
possible alternatives, and wCNM is just one of many po- 
tential weighted modularity algorithms. The main con- 
tribution of this paper is to spread awareness that reso- 
lution limits may in fact be tolerated while retaining the 
advantages of modularity maximization and the efficient 
algorithms for this computation. 
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